DSpin: Detecting Automatically Spun Content on the Web
نویسندگان
چکیده
Web spam is an abusive search engine optimization technique that artificially boosts the search result rank of pages promoted in the spam content. A popular form of Web spam today relies upon automated spinning to avoid duplicate detection. Spinning replaces words or phrases in an input article to create new versions with vaguely similar meaning but sufficiently different appearance to avoid plagiarism detectors. With just a few clicks, spammers can use automated tools to spin a selected article thousands of times and then use posting services via proxies to spam the spun content on hundreds of target sites. The goal of this paper is to develop effective techniques to detect automatically spun content on the Web. Our approach is directly tied to the underlying mechanism used by automated spinning tools: we use a technique based upon immutables, words or phrases that spinning tools do not modify when generating spun content. We implement this immutable method in a tool called DSpin, which identifies automatically spun Web articles. We then apply DSpin to two data sets of crawled articles to study the extent to which spammers use automated spinning to create and post content, as well as their spamming behavior.
منابع مشابه
Analyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملAutomatically Detecting Members and Instrumentation of Music Bands Via Web Content Mining
In this paper, we present an approach to automatically detecting music band members and instrumentation using web content mining techniques. To this end, we combine a named entity detection method with a rule-based linguistic text analysis approach extended by a rule filtering step. We report on the results of different evaluation experiments carried out on two test collections of bands coverin...
متن کاملEffect of Joule-Heating Annealing on Giant Magnetoimpedance of Co64Fe4Ni2B19-xSi8Cr3Alx (x = 0, 1 and 2) Melt-Spun Ribbons
In this work, we have studied the influence of dc joule-heating thermal processing on the structure, magnetoimpedance (MI) and thermal properties of Co64Fe4Ni2B19-xSi8Cr3Alx (x = 0, 1, and 2) rapidly solidified melt-spun ribbons. The nanocrystallization process was carried out by the current annealing of as-spun samples at various current densities. As-spun and joule-heated samples were studied...
متن کاملWho Are We Listening to? Detecting User-generated Content (UGC) on the Web
The analysis of text-based user-generated content (UGC) on the Web has become one highly acclaimed topic in recent years both in theory and practice. As users are able to participate and publicly comment on almost any webpage nowadays, UGC occurs scattered across the web and mixes with various content types such as advertising texts, product descriptions or other editorial articles. Holistic re...
متن کاملA Publishing System for Efficiently Creating Dynamic Web Content
This paper presents a publishing system for etiiciently creating dynamic Web content. Complex Web pages are constructed from simpler fragments. Fragments may recursively embed other fragments. Relationships betweenWeb pages and fragments are represented by object dependence graphs. We present algorithms for efficiently detecting and updating Web pages affected after one or more fragments change...
متن کامل